Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement muladd #9840

Merged
merged 2 commits into from
Jan 22, 2015
Merged

Implement muladd #9840

merged 2 commits into from
Jan 22, 2015

Conversation

eschnett
Copy link
Contributor

(This branch is based on the fma branch which is likely to be merged soon #8112 . Since the implementations of fma and muladd are similar, there would otherwise be many conflicts.)

Implement muladd(x,y,z), a fast way to calculate x*y+z.

This is very different from fma(x,y,z), which also calculates x*y+z. fma is about accuracy; it guarantees that the intermediate result x*y is not rounded. This may be very slow on some platforms. muladd is guaranteed to be fast, and will use architecture-specific instructions if available. If fma happens to be the fastest way for this operation, the muladd will be equivalent to fma.

@simonbyrne
Copy link
Contributor

Fantastic, thanks.

@ivarne ivarne mentioned this pull request Jan 20, 2015
@@ -986,6 +986,20 @@ static Value *emit_intrinsic(intrinsic f, jl_value_t **args, size_t nargs,
ArrayRef<Type*>(x->getType())),
FP(x), FP(y), FP(z));
}
HANDLE(muladd_float,3)
#ifdef LLVM34
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API changed for only v3.4? Will this work on v3.3 and versions >= v3.5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM34 doesn't do what you probably think it does -- it means "LLVM 3.4 or later".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, sorry for the noise.

simonbyrne added a commit that referenced this pull request Jan 22, 2015
@simonbyrne simonbyrne merged commit e726b6c into JuliaLang:master Jan 22, 2015
@simonbyrne
Copy link
Contributor

Thanks again, it's good to have this.

@eschnett eschnett deleted the muladd branch January 22, 2015 14:29
@stevengj
Copy link
Member

What about access to multiply-subtract and the other three FMA variants as discussed in #6330? See, for example, @evalpoly as discussed in #9881, which needs fnma.

@eschnett
Copy link
Contributor Author

These other variants are not exposed by LLVM as intrinsics. I assume that LLVM's optimizer will generate these automatically when it encounters nearby fneg instructions, in particular when they are marked as fast. muladd does not mark instructions as fast, but the @fastmath macro does.

@stevengj
Copy link
Member

Then why have muladd at all, as opposed to just using @fastmath?

@eschnett
Copy link
Contributor Author

@fastmath is much more aggressive than muladd. I would guess that using muladd where possible is harmless in almost all cases, and it helps getting the full floating point performance of a CPU. The latter may not actually be possible when you write x*y+z.

On the other hand, @fastmath can lead to noticeable differences. @fastmath e.g. also assumes that there are no Infs and NaNs. @fastmath leads to incorrect results if its assumptions are not met, whereas muladd may just provide a tiny bit more accuracy than you specified.

A more fine-grained set of options would be possible. For example, LLVM distinguishes between "assume there are no nans", "assume there are no infs", "don't care about the sign of zero", and "allow reciprocals instead of division". Another flag could be "flush subnormal numbers to zero". But such a large set of options would be quite confusing.

@simonbyrne
Copy link
Contributor

Last time I checked, LLVM was able to transform things like

fma(a,b,-c)

into the relevant fused multiply-subtract instruction.

@simonbyrne
Copy link
Contributor

Okay, on a Haswell machine running LLVM 3.5:

julia> foo(x,y,z) = muladd(x,y,-z)
foo (generic function with 1 method)

julia> @code_native foo(1.0,2.0,3.0)
    .text
Filename: none
Source line: 0
    push    rbp
    mov rbp, rsp
Source line: 1
    vfmsub213sd xmm0, xmm1, xmm2
    pop rbp
    ret

and

julia> foo(x,y,z) = muladd(-x,y,z)
foo (generic function with 1 method)

julia> @code_native foo(1.0,2.0,3.0)
    .text
Filename: none
Source line: 0
    push    rbp
    mov rbp, rsp
Source line: 1
    vfnmadd213sd    xmm0, xmm1, xmm2
    pop rbp
    ret

So it does look like it can handle the transformations correctly.

@stevengj
Copy link
Member

@simonbyrne, good to know that muladd + unary-minus is enough.

@simonbyrne
Copy link
Contributor

There is one downside however: on a non-fma machine (using LLVM 3.3)

julia> foo(x,y,z) = muladd(-x,y,z)
foo (generic function with 1 method)

julia> @code_native foo(1.0,2.0,3.0)
    .section    __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
    push    RBP
    mov RBP, RSP
    movabs  RAX, 4484106720
Source line: 1
    vxorpd  XMM0, XMM0, XMMWORD PTR [RAX]
    vmulsd  XMM0, XMM0, XMM1
    vaddsd  XMM0, XMM0, XMM2
    pop RBP
    ret

The code takes an explicit negation, rather than use a subtraction (the other case is fine).

@stevengj
Copy link
Member

Yikes, that's unfortunate.

@simonbyrne
Copy link
Contributor

Ah, it does appear to be fixed with LLVM 3.5, so I guess that's not too big of a problem.

@stevengj
Copy link
Member

Is there some way to put in a regression test for this in case LLVM changes its mind again in the future?

@eschnett
Copy link
Contributor Author

That's difficult from Julia. We don't really care about what instructions LLVM generates, we care that it executes as fast as possible. Instruction selection (and register selection and instruction scheduling) isn't typically something that we worry about in Julia. I think this should be an LLVM test case instead, checking that the LLVM muladd intrinsic generates the right code.

I checked, and there doesn't seem to be such a check at the moment. This test ./CodeGen/X86/extended-fma-contraction.ll comes close, and could probably be adapted:

; RUN: llc -march=x86 -mcpu=bdver2 -mattr=-fma -mtriple=x86_64-apple-darwin < %s | FileCheck %s
; RUN: llc -march=x86 -mcpu=bdver2 -mattr=-fma,-fma4 -mtriple=x86_64-apple-darwin < %s | FileCheck %s --check-prefix=CHECK-NOFMA

; CHECK-LABEL: fmafunc
define <3 x float> @fmafunc(<3 x float> %a, <3 x float> %b, <3 x float> %c) {

; CHECK-NOT: vmulps
; CHECK-NOT: vaddps
; CHECK: vfmaddps
; CHECK-NOT: vmulps
; CHECK-NOT: vaddps

; CHECK-NOFMA-NOT: calll
; CHECK-NOFMA: vmulps
; CHECK-NOFMA: vaddps
; CHECK-NOFMA-NOT: calll

  %ret = tail call <3 x float> @llvm.fmuladd.v3f32(<3 x float> %a, <3 x float> %b, <3 x float> %c)
  ret <3 x float> %ret
}

declare <3 x float> @llvm.fmuladd.v3f32(<3 x float>, <3 x float>, <3 x float>) nounwind readnone

@StefanKarpinski
Copy link
Member

Currently Julia's own IR is represented as data but as soon as we get to LLVM IR we just print a bunch of text. A great project would be to expose LLVM IR as data in Julia – that would make testing for this kind of thing possible and possibly even straightforward.

@tkelman
Copy link
Contributor

tkelman commented Jan 22, 2015

I've thought the same thing - #8275 (comment)

Having Julia data structures for LLVM IR (and assembly too?) could be pretty useful. If there's not an existing up-for-grabs issue, should we open one?

@StefanKarpinski
Copy link
Member

It might be better to stop at LLVM IR. As long as we only expose LLVM in Julia, it's hard to write inherently non-portable code, which is kind of nice. I guess if we wanted to reify machine code, each platform could have a different platform-specific representation. Please do open an issue!

@eschnett
Copy link
Contributor Author

Oooh, LLVM passes written in Julia!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants